Palantir Technologies

VAST 2010 Challenge
Hospitalization Records -  Characterization of Pandemic Spread

Authors and Affiliations:

Palantir Technologies – VAST10 Team
Brandon Wright, Palantir Technologies, bwright@palantirtech.com

Jesse Rickard, Palantir Technologies

Alex Polit, Palantir Technologies

Jason Payne, Palantir Technologies

Tool(s):

Overview: Palantir Horizon is part of Palantir’s approach for big data: rapid, interactive analysis of datasets that contain billions of records. Project Horizon was developed as a Palantir “Hack Day” project on top of the Palantir platform; it empowers analysts to start with their entire ecosystem of data (literally billions of rows of data), and iteratively pare the data down to discover the proverbial needle in the haystack.

Horizon

http://www.palantirtech.com/horizon

Background: Palantir is operational today at many of the most prestigious intelligence, defense, law enforcement, and regulation/oversight organizations in the world. Palantir was put together by the founders of PayPal, capitalizing on the lessons learned by their anti-fraud department. Facing highly coordinated cyber attacks in order to commit payment fraud and exploit sensitive consumer information, an entirely new approach was required. Existing technology was poorly suited to dealing with sparse, cyber-specific data. To defeat the international fraud rings, high level conceptual access to the data was required. The analyst-driven intelligence analysis tools that eventually became the Palantir platform were a direct outgrowth of this effort.

Company Web site:
http://www.palantirtech.com

Check out our Analysis Blog to see more analysis using Palantir: http://www.palantirtech.com/government/analysis-blog.

Video:

 

MC2.wmv

 

ANSWERS:


MC2.1: Analyze the records you have been given to characterize the spread of the disease.  You should take into consideration symptoms of the disease, mortality rates, temporal patterns of the onset, peak and recovery of the disease.  Health officials hope that whatever tools are developed to analyze this data might be available for the next epidemic outbreak.  They are looking for visualization tools that will save them analysis time so they can react quickly.

We used Palantir’s Horizon platform to analyze the spread of the disease.  Our approach involved visualizing the key macro characteristics of the entire dataset of approximately 15 million hospitalization records, and iteratively deriving key subsets and properties in order to produce more granular visualizations. 

In terms of time commitment, our team parsed all symptoms manually in Excel and Access, which took several hours.  Importing the data into Palantir took several minutes.  Actual workflows proceeded essentially as quickly as the analyst could think of ways to view the data.  Each custom view took anywhere from 0.5 to 3 seconds to generate.  Total analysis time was approximately 2.5 hours of a single analyst creating near-instantaneous views of the data to understand the behavior patterns of the outbreak.

Our analysis of Mini Challenge 2.1 comprised two main sections.  We began by performing analysis of temporal patterns, which allowed us to isolate subsets of likely disease-related deaths vs. non-disease-related deaths among total hospitalizations.  From there, we performed statistical analyses of the prevalence of various symptoms for each subset as well as the correlation of those symptoms with mortality rates.

Temporal Patterns:

We used the “date” property of the entire data set to plot the hospital admittances on a timeline.  No clear pattern emerged, but we did find a general peak in hospitalizations around the middle of May.   However, once we drilled down on hospitalizations resulting in death, the timeline revealed a strong pattern, with a peak in hospitalizations occurring on the 16th of May:

Screenshot

Figure MC2.1.1: Timeline of hospitalizations.  The first timeline represents total hospitalizations by date, while the second represents only those hospitalizations resulting in death.

 

In order to isolate likely disease-related deaths, we wanted to determine how long it took patients to die after being hospitalized.  We accomplished this by deriving new property, “days til death”, which subtracts the death date from the hospitalization date according to the formula [(${death date} - $date) / 86400000]. 

We then produced a property value histogram of the “days til death” property, which revealed that almost all deaths (96.9%) occurred 8 days after hospitalization:

 

Screenshot

Figure MC2.1.2: Property value histogram showing distribution of cases by “days til death” property.

 

Statistical Patterns:

It should be acknowledged that it is difficult to distinguish hospitalizations related to the disease from those that are not. However, it is easier to distinguish disease-related deaths, since they occurred 8 days after hospitalization. 

We generated property value histograms for the property “symptoms” for all hospitalizations; hospitalizations resulting in death from the disease (death after 8 days); and all other hospitalizations resulting in death.  The key symptoms for the disease are Vomiting, Abdominal Pain, Nose Bleed, Back Pain and Diarrhea (Fever is also common in disease cases, but only slightly more so than in non-disease cases – 9.6% vs. 8.0%).

 

Screenshot

Figure MC2.1.3: A summary view of cases in which “days til death” = 8, viewed alongside a histogram of symptoms for those cases.

 

The following list provides a breakdown of the top symptoms for each subset of cases:

Vomiting 10.5%
Fever 8.3%
Abdominal Pain 8.1%
Diarrhea 4.3%
Back Pain 4.1%
Headache3.0%
Rash 2.7%
Blurred Vision 1.7%
Cough 1.7%
Swelling 1.6%
Nose Bleed 1.2% 

Vomiting 29.6%
Abdominal Pain 21.4%
Diarrhea 12.4%
Back Pain 10.3%
Fever 9.6%
Swelling 3.4%
Nose Bleed 3.4%
Neck Pain 1.4%
Headache 1.4%
Blurred Vision 1.4%
Tremors 0.7%
Hearing Loss 0.7%
Abnormal Labs 0.7%
Nausea 0.7%
Proteinuria 0.7%
Leg Pain 0.7%
Rash 0.7%
Conjunctivitis 0.7%

Fever 8.0%
Vomiting 4.9%
Abdominal Pain 4.1%
Headache 3.6%
Rash 3.3%
Cough 2.4%
Back Pain 2.2%
Diarrhea 2.1%
Blurred Vision 1.8%
Chest Pain 1.2%
Dizziness 1.1%
Nausea 1.1%
Weakness 1.1%
Swelling 1.1%
Shortness Of Breath 1.1%

To determine which of these symptoms are most likely to be fatal, we generated a “Pocket Histogram” in order to view property values positively correlated with death.   Among total hospitalizations, a number of symptoms show a positive correlation with death, most frequently Tremors:

 

Screenshot

Figure MC2.1.4:  Pocket Histogram showing correlation of various symptoms with death among total hospitalizations.

 

We noted the following correlations among both total hospitalizations and those resulting in death after 8 days:

·         Symptoms Positively Associated With Death (within total hospitalizations):

symptoms='Tremors' / 4.22 / 3901
symptoms='Hearing Loss' / 4.06 / 3755
symptoms='Abnormal Labs' / 4.06 / 3753
symptoms='Proteinuria' / 4.01 / 3707
symptoms='Conjunctivitis' / 3.93 / 3652
symptoms='Vomiting' / 3.59 / 93214
symptoms='Nose Bleed' / 3.32 / 18453
symptoms='Diarrhea' / 3.24 / 3771
symptoms='Abdominal Pain' / 3.18 / 112013
symptoms='Back Pain' / 2.91 / 48568
symptoms='Swelling' / 2.01 / 11126
symptoms='Pregnant' / 1.34 / 213
symptoms='Fever' / 1.29 / 41781
symptoms='Vaginal Problems' / 1.18 / 645

·         Symptoms positively associated with 8-day deaths (i.e. deaths from the disease):

symptoms='Hearing Loss' / 1.03 / 3750
symptoms='Abnormal Labs' / 1.03 / 3747
symptoms='Proteinuria' / 1.03 / 3701
symptoms='Tremors' / 1.03 / 3891
symptoms='Vomiting' / 1.03 / 92899
symptoms='Conjunctivitis' / 1.03 / 3639
symptoms='Abdominal Pain' / 1.03 / 111502
symptoms='Diarrhea' / 1.03 / 3751
symptoms='Nose Bleed' / 1.03 / 18351
symptoms='Back Pain' / 1.03 / 48289
symptoms='Pregnant' / 1.02 / 211
symptoms='Vaginal Problems' / 1.02 / 638
symptoms='Swelling' / 1.02 / 10999
symptoms='Fever' / 1.01 / 40806

Additionally, the pocket histogram for total hospitalizations indicates that chances of death are fairly similar for each city with the exceptions of Nonthaburi, Thailand and Mersin, Turkey, both of which have extremely low death rates.  This may indicate that the disease did not spread to Thailand or Turkey:

 

Screenshot

Figure MC2.1 5: Correlation of various locations with death among overall hospitalizations.

 

Mortality Rates:

2.5% of overall hospitalizations resulted in death.  2.2% of overall cases of hospitalizations resulted in death that appeared related to the disease.  Mortality rates varied slightly between locations (Turkey and Thailand are extremely low, as they were most likely unaffected by the disease)

 


MC2.2:  Compare the outbreak across cities.  Factors to consider include timing of outbreaks, numbers of people infected and recovery ability of the individual cities.  Identify any anomalies you found.

Timing of Outbreak: 

In order to create a statistical reference point, we generated the following table using figures derived from time histograms of death dates (both overall and disease-related) for each location.

 

 

 

 First death

 First seemingly disease-related death

 Initial small jump in deaths

 Big jump in deaths

 Peak

 Last death

Nairobi

4/27/2009

4/27/2009

5/2/2009

5/4/2009

5/22/2009

6/24/2009

Aleppo

4/28/2009

4/28/2009

5/4/2009

5/5/2009

5/23/2009

6/30/2009

Yemen

4/29/2009

4/29/2009

5/4/2009

5/6/2009

5/24/2009

6/30/2009

Lebanon

4/29/2009

4/29/2009

5/4/2009

5/6/2009

5/25/2009

6/26/2009

Karachi

4/30/2009

4/30/2009

5/6/2009

5/6/2009

5/25/2009

6/29/2009

Saudi Arabia

4/24/2009

5/2/2009

5/5/2009

5/7/2009

5/25/2009

6/28/2009

Venezuela

5/1/2009

5/1/2009

5/5/2009

5/8/2009

5/27/2009

6/28/2009

Iran

5/2/2009

5/2/2009

5/6/2009

5/8/2009

5/27/2009

6/29/2009

Colombia

5/2/2009

5/2/2009

5/6/2009

5/9/2009

5/28/2009

6/30/2009

 

The first reported death for each city fell within the period from 4/27/2009 to 5/2/2009.  In most cases, the first death and the first seemingly disease-related death occurred on the same date.  However, in the case of Saudi Arabia, the first disease-related death occurred 8 days after the first death overall.  An initial small jump in deaths occurred 4-6 days after the first disease-related death in each location.  A bigger jump occurred anywhere from 0-3 days after the initial jump (both jumps occurred the same day in the case of Karachi only).  For all locations, deaths peaked approximately 25 days after the first reported death, then subsided over the next month or so.

 

Infections By Location: Relative and Total

In order to visualize the relative geographic concentration of disease-related deaths, we generated a heatmap for all locations.  Note that Turkey and Thailand are deep blue, indicating few deaths.

 

Screenshot

Figure MC2.2.1: Heatmap of disease-related deaths by outbreak location.

 

We can also view total disease-related deaths per location using a property value histogram based on the “location” property for each death:

 

Screenshot

Figure MC2.2.2: Total deaths from disease by location.

 

Recovery Ability:

As noted above, in each country, deaths peaked approximately 25 days after the first reported death, then began dropping, finally tapering off approximately a month after the peak.  This can be seen in the above table, and also illustrated in this scattergram of daily deaths by location:

 

Screenshot

Figure MC2.2.3:  Scattergram of deaths.  Y= location, X=day number, scale=percentile.

 

In general, the timelines of the disease for each location are extremely similar, whether we were looking at the date of hospitalization or the date of death.  The only major difference was found in the dates the disease first appeared in each location.  Below is a figure comparing the date of death timelines for Karachi, Pakistan (165,605 total deaths) and Aden, Yemen (7,711 total deaths):

 

Screenshot

Figure MC2.2.4: Time histograms showing dates of death from disease for Karachi, Pakistan, and Aden, Yemen.

 

Anomalies:

The most obvious anomaly was the fact that Turkey and Thailand were seemingly unaffected by the disease; however, we lacked sufficient data to form a hypothesis as to why this would be the case.

Another notable anomaly concerned the ages of those hospitalized.  Time histograms for hospitalizations resulting in death, hospitalizations resulting in death within 8 days, and all other deaths each formed an almost-perfect bell curve, with a peak age between 43 and 45. This is an anomaly, since the 0-40 range usually forms the bulk of most populations, but comprises relatively small percentage of hospital visits.   This seems to indicate that people less than 43-45 years old are less susceptible to the disease, and the closer the patient’s age is to zero, the less susceptible he/she is:

 

Screenshot

Figure MC2.2.5:  Time histogram showing the age distribution of patients thought to have died of the disease (8 days after being hospitalized).